In this project I performed an exploratory analysis on data provided by Ford GoBike, a bike-share system provider, using Python visualization techniques. The goal is to figure out what variables possess the most influential power on a bike sharing service. I did the analysis on 2019 year data
Columns:
Understand he number of subscribers/customers in the dataset and which of them are more valuable to the company.
The time for which the bike is rented on an average and the distance in miles travelled by different classes of users.
Understand whether the start time of rentals differ between subscribers and customers.
The locations which sees the most rentals among different classes of users.
fig, ax = plt.subplots(nrows=2, figsize = [12, 6])
ax[0].hist(data = df, x='duration_mins', bins=np.arange(0, df.duration_mins.mean()+40, 1), color='b', alpha=0.8, edgecolor='k')
ax[1].hist(data = df, x = 'duration_mins', bins=10 ** np.arange(0, 2.0, 0.09), color='b', alpha=0.8, edgecolor='k')
plt.xscale('log')
plt.sca(ax[0])
plt.xticks(np.arange(0, 55, 1))
plt.title("Rental duration in minutes - Normal Scale", fontsize=16)
plt.ylim(0, 180000)
plt.xlabel("Rental duration in minutes", fontsize=16)
plt.sca(ax[1])
plt.title("Rental duration in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration in minutes - Log x scale", fontsize=16)
plt.tight_layout()
plt.show()
From Normal Plot: Distribution of rental duration is right-skewed, there are rentals for about an hour or so by users.
From the Log Plot: Rental duration is roughly bimodal.
sns.countplot(data = df, x = 'user_type', color='b')
sns.set(rc={'figure.figsize':(12, 6)})
plt.title("Number of Subscribers vs Customers", fontsize=16)
plt.xlabel("User Type", fontsize=16)
plt.show()
sub_count = df.user_type.value_counts()[0]
cust_count = df.user_type.value_counts()[1]
total = sub_count+cust_count
plt.pie([sub_count/total*100, cust_count/total*100], autopct='%1.1f%%', labels = ['Subscribers', 'Customers'], startangle=90)
plt.title('Percentage of Subscribers vs Customers', fontsize=16)
plt.axis('equal')
plt.show()
# Distribution of rentals across days of the week, months and hour
fig, ax = plt.subplots(nrows=3, figsize = [14, 8])
sns.countplot(data = df, x = 'start_hour', color = 'b', alpha=0.8, ax = ax[0])
plt.sca(ax[0])
plt.xlabel("Start Hour", fontsize=16)
sns.countplot(data = df, x = 'start_day', color = 'b', alpha=0.8, ax = ax[1], order=days)
plt.sca(ax[1])
plt.xlabel("Start Day", fontsize=16)
sns.countplot(data = df, x = 'start_month', color = 'b', alpha=0.8, ax = ax[2])
plt.sca(ax[2])
plt.xlabel("Start Month", fontsize=16)
plt.tight_layout()
plt.show()
sns.countplot(df[df.user_type == 'Subscriber'].bike_share_for_all_trip)
plt.title("Subscribers in Bike Share for All scheme", fontsize=16)
plt.show()
/home/abdulrahman/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
no_count = df.bike_share_for_all_trip.value_counts()[0]
yes_count = df.bike_share_for_all_trip.value_counts()[1]
plt.pie([no_count/(no_count+yes_count)*100, yes_count/(no_count+yes_count)*100], autopct='%1.1f%%', labels=['No', 'Yes'], startangle=90)
plt.title("Percentage of Subscribers in Bike Share for All scheme", fontsize=15)
plt.axis('equal')
plt.show()
plt.pie([sub_dist/tot_dist*100, cust_dist/tot_dist*100], autopct='%1.1f%%', labels=['Subscribers', 'Customers'], startangle=90)
plt.title("Distance travelled", fontsize=16)
plt.axis('equal')
plt.show()
plt.pie([sub_dur/tot_dur*100, cust_dur/tot_dur*100], autopct='%1.1f%%', labels=['Subscribers', 'Customers'], startangle=90)
plt.axis('equal')
plt.title('Rental duration of Subscribers and Customers', fontsize=15)
plt.show()
fig, ax = plt.subplots(nrows=2, figsize = [12, 5])
sub = df[df.user_type=='Subscriber']
ax[0].hist(sub.duration_mins, bins=np.arange(0, 40, 1), color='b', alpha=0.8, edgecolor='k')
ax[1].hist(sub.duration_mins, bins=10 ** np.arange(0, 2, 0.08), color='b', alpha=0.8, edgecolor='k')
plt.xscale('log')
plt.sca(ax[0])
plt.xticks(np.arange(0, 40, 1))
plt.title("Rental duration for subscribers in minutes - Normal Scale", fontsize=16)
plt.xlabel("Rental duration for subscribers in minutes", fontsize=16)
plt.sca(ax[1])
plt.title("Rental duration for subscribers in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration for subscribers in minutes", fontsize=16)
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(nrows=2, figsize = [12, 5])
sub = df[df.user_type=='Customer']
ax[0].hist(sub.duration_mins, bins=np.arange(0, 40, 1), color='b', alpha=0.8, edgecolor='k')
ax[1].hist(sub.duration_mins, bins=10 ** np.arange(0, 2, 0.08), color='b', alpha=0.8, edgecolor='k')
plt.xscale('log')
plt.sca(ax[0])
plt.xticks(np.arange(0, 40, 1))
plt.yticks(np.arange(0, 18000, 4000))
plt.title("Rental duration for customers in minutes - Normal Scale", fontsize=16)
plt.xlabel("Rental duration for customers in minutes", fontsize=16)
plt.sca(ax[1])
plt.title("Rental duration for customers in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration for customers in minutes", fontsize=16)
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(nrows=2, figsize = [12, 5])
sub = df[df.bike_share_for_all_trip=='Yes']
ax[0].hist(sub.duration_mins, bins=np.arange(0, 40, 1), color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[0])
plt.xticks(np.arange(0, 40, 1))
plt.title("Rental duration for all users in minutes - Normal Scale", fontsize=16)
plt.xlabel("Rental duration for all users in minutes", fontsize=16)
ax[1].hist(sub.duration_mins, bins=10 ** np.arange(0, 2, 0.08), color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[1])
plt.xscale('log')
plt.title("Rental duration for all users in minutes - Log Scale", fontsize=16)
plt.xlabel("Rental duration for all users in minutes", fontsize=16)
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(nrows=3, figsize = [20, 15])
sub = df[df.user_type=='Subscriber']
bins = np.arange(0, 3.5, 0.05)
ax[0].hist(sub.distance_miles, bins=bins, color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[0])
plt.xticks(np.arange(0, 4, 0.1))
plt.title("Subscribers", fontsize=16)
plt.xlabel("Distance in miles", fontsize=16)
cust = df[df.user_type=='Customer']
ax[1].hist(cust.distance_miles, bins=bins, color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[1])
plt.xticks(np.arange(0, 4, 0.1))
plt.title("Customers", fontsize=16)
plt.xlabel("Distance in miles", fontsize=16)
ax[2].hist(df[df.bike_share_for_all_trip == 'Yes'].distance_miles, bins=bins, color='b', alpha=0.8, edgecolor='k')
plt.sca(ax[2])
plt.xticks(np.arange(0, 4, 0.1))
plt.title("Bike Share", fontsize=16)
plt.xlabel("Distance in miles", fontsize=16)
plt.tight_layout()
plt.show()
data = df[df.duration_mins < 20]
sns.catplot(data=data, y='duration_mins', col="user_type", kind='box')
sns.catplot(data=data, y='duration_mins', col="user_type", kind='violin')
plt.suptitle("Duration in minutes of Rentals for Subscribers and Customers", y=1.1)
plt.show()
a = df[df.user_type == 'Customer'].start_station_name.value_counts()[:40]
sns.barplot(a, a.index, palette='autumn')
sns.set(rc={'figure.figsize':(16, 10)})
plt.yticks(fontsize=16)
plt.tight_layout()
plt.show()
/home/abdulrahman/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
a = df[df.user_type == 'Subscriber'].start_station_name.value_counts()[:40]
sns.barplot(a, a.index, palette='autumn')
sns.set(rc={'figure.figsize':(16, 10)})
plt.yticks(fontsize=16)
plt.tight_layout()
plt.show()
/home/abdulrahman/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
a = df[df.bike_share_for_all_trip == "Yes"].start_station_name.value_counts()[:40]
sns.barplot(a, a.index, palette='autumn')
sns.set(rc={'figure.figsize':(16, 10)})
plt.yticks(fontsize=16)
plt.tight_layout()
plt.show()
/home/abdulrahman/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
a = sns.FacetGrid(df[df.user_type=='Customer'], col='start_day', col_wrap=5, col_order=days, sharey=True)
a = (a.map(plt.hist, 'start_hour', bins=np.arange(0, 24, 1), color='b', alpha=0.8, edgecolor='k').set_axis_labels("Day of the week", "Start Hour"))
a.set(xticks=np.arange(0, 24, 2))
plt.suptitle("Start Hour vs Day of the Week for Subscribers", y=1.1)
plt.show()
a = df[df.user_type=='Subscriber'].groupby(['start_hour', 'start_day'])['bike_id'].size().reset_index().pivot('start_hour', 'start_day', 'bike_id')[days];
heat_map = sns.heatmap(a, cmap = 'viridis_r', annot=True, robust=True, fmt='d', linewidths=.5)
plt.title('Subscriber', y=1)
plt.xlabel('Weekday', labelpad = 18)
plt.ylabel('Start Time Hour', labelpad = 18)
plt.show()
Image(filename=map_cust_img)
Image(filename=map_subs_img)
Image(filename=map_bikeshare_img)
Subscribers rented the bikes for less time than customers but travelled more they seems to be in hurry.
Relationship between start Day, start Hour and user type which give a hint about the nature of the user type.
Subscribers uses bikes as transportation option.
Subscribers are regular customers who are making rides to/from work or school, renting a bike at 7-9am and 4-6pm on weekdays
Customers rent bikes for exploring the Bay area and they could be tourists.
Customers could be tourists who use bikes to explore the Bay area mainly on weekends.